Vol. 30 ISMB 2014, pages i96-i104 
doi:10. 1093/bioinformatics/btu262 



Large scale analysis of signal reachability 

Andrei Todor'', Haitham Gabr, Alin Dobra and Tamer Kahveci 

CISE Department, University of Florida, Gainesville, FL 3261 1 , USA 



ABSTRACT 

Motivation: Major disorders, such as leukemia, have been shown to 
alter the transcription of genes. Understanding how gene regulation is 
affected by such aberrations is of utmost importance. One promising 
strategy toward this objective is to compute whether signals can reach 
to the transcription factors through the transcription regulatory net- 
work (TRN). Due to the uncertainty of the regulatory interactions, this 
is a #P-complete problem and thus solving it for very large TRNs 
remains to be a challenge. 

Results: We develop a novel and scalable method to compute the 
probability that a signal originating at any given set of source genes 
can arrive at any given set of target genes (i.e., transcription factors) 
when the topology of the underlying signaling network is uncertain. 
Our method tackles this problem for large networks while providing a 
provably accurate result. Our method follows a divide-and-conquer 
strategy. We break down the given network into a sequence of non- 
overlapping subnetworks such that reachability can be computed au- 
tonomously and sequentially on each subnetwork. We represent each 
interaction using a small polynomial. The product of these polynomials 
express different scenarios when a signal can or cannot reach to 
target genes from the source genes. We introduce polynomial collap- 
sing operators for each subnetwork. These operators reduce the size 
of the resulting polynomial and thus the computational complexity 
dramatically. We show that our method scales to entire human regu- 
latory networks in only seconds, while the existing methods fail 
beyond a few tens of genes and interactions. We demonstrate that 
our method can successfully characterize key reachability character- 
istics of the entire transcriptions regulatory networks of patients 
affected by eight different subtypes of leukemia, as well as those 
from healthy control samples. 

Availability: All the datasets and code used in this article are available 
at bioinformatics.cise.ufl.edu/PReach/scalable.htm. 
Contact: atodor@cise.ufl.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 



1 INTRODUCTION 

Major disorders, such as cancer, have been shown to aUer the 
transcription of a large number of genes and thus affect the 
mechanism that governs cells functions (Krivtsov, 2009; Valk 
et aL, 2004). Many complex disorders, such as acute lympho- 
blastic leukemias, however, yield a varying spectrum of expres- 
sion profiles and, as a result, cannot be robustly characterized by 
merely studying the gene expressions (Armstrong, 2002). 

An important part of cell biology research is the study of the 
causal relationship between extracellular conditions and the cell 
response. Such causality is governed by a chain of biochemical 
reactions through which extracellular signals are transmitted 
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from membrane receptors to transcription factors (i.e., reporters) 
via protein-protein interactions (Bu and Callaway, 2011). While 
the pattern of this mechanism is similar for all organisms, im- 
portant variations in its quantitative aspects such as gene expres- 
sions result from external perturbations, differentiation stage 
of the cell, timing of DNA replication and various epigenetic 
mutations (Los et aL, 2009; Mattick et aL, 2009). Therefore, de- 
tecting these quantitative variations is an important source of 
information for assessing the fitness of the organism and ultim- 
ately for diagnosis and prognosis. 

Extensive evidence suggests that there is a degree of uncer- 
tainty in our knowledge of interactions within cells (Bader 
et aL, 2004; Ceol et aL, 2010; Deng et aL, 2003; Ourfah et aL, 
2007; Sharan et aL, 2002; Suthram et aL, 2006; Szklarczyk et aL, 
2011; von Mering et aL, 2002). The source of this uncertainty is 
2-fold. First, the biological processes that are modeled as protein 
interactions in biological networks are stochastic events (Bader 
et aL, 2004). Second, the evidence in support of an interaction is 
not entirely decisive for the actual presence of the interaction 
(Bader et aL, 2004; Ourfah et aL, 2007; Sharan et aL, 2002; 
Shlomi et aL, 2006) due to many reasons, such as epigenetic 
variations across different cells (Gerstein et aL, 2012). Several 
schemes have already been proposed to assess the rehabihty of 
protein interactions in the form of confidence values (Bader 
et aL, 2004; Deng et aL, 2003; Suthram et aL, 2006). Such inter- 
action confidence values are now available in large biological 
network databases, such as MINT (Ceol et aL, 2010) and 
STRING (Szklarczyk et aL, 2011). 

Recent studies often model the uncertainty of the interactions 
in biological networks using probabilistic networks (Gabr et aL, 
2013; Todor et aL, 2012; Todor et aL, 2013). We adopt the same 
model in this article, namely, each node of the network denotes a 
gene and the directed edge from a node V/ to node Vj denotes that 
the gene corresponding to v, can regulate the gene denoted by Vj 
through an interaction. Each edge in this network is a probabil- 
istic event. That is, it is considered possible, but not certain, 
reflecting the insecure knowledge of the gene regulation process. 
A common way to model the uncertainty of each edge is to 
associate it with a probabihty value, which is computed for 
each interaction from several factors: gene expressions, available 
evidence for it and network topology around it (Sharan et aL, 
2002). 

The abihty to compute confidence values for interactions pro- 
vides opportunities to model and study biological networks ac- 
curately. It, however, comes at a high price as the uncertainty of 
the topology of interactions makes studying biological networks 
a computationally challenging task. The challenge is that a prob- 
abihstic network represents a large number of alternative deter- 
ministic network topologies. More precisely, a network with n 
probabihstic edges yields 2" possible network configurations, as 
each one of the n edges may be present or absent. For instance, in 
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Fig. 1. A probabilistic network (top) and two of the deterministic net- 
works corresponding to it (bottom). Each of the deterministic networks is 
obtained from the probabilistic network with some probability deter- 
mined by the probabilities of the edges that are included or not in the 
deterministic network, pi denotes the probability of edge being present. 
qi = 1- Pi is the probability of the edge being absent. The expression 
above each deterministic network is the probability of observing it 

Figure 1 , the probabihstic network shown on top corresponds to 
16 deterministic networks since it contains 4 probabilistic edges. 

In this article, we address the problem of characterizing the 
signaling reachability in transcription regulatory networks 
(TRNs). Unlike most of the existing literature, we eliminate the 
limitations of the classical assumption that all interactions are de- 
terministic and adopt the more descriptive probabilistic network. 
More specifically, given a set of source genes ^= {^i, ^2, • • • , ^^a) 
and a set of target genes T= {ti, t2, . . . , tjj], we compute the 
reachability profile of that network as a doubly indexed vector 
R where, for all /, j such that 1 <i<a,\ <j<b, the entry R[iJ\ 
is the probability that a signal originating at Sj can reach tj (i.e., Si 
regulates tj). We show that the reachabihty profile can help us 
understand how different disorders that alter the cellular func- 
tions based on the signaling patterns of the gene regulatory net- 
works. We particularly focus on leukemias, which is challenging 
due to the heterogeneity of the transcription patterns. 

Summary of related work. The problem of computing reach- 
abihty in uncertain network topologies has drawn significant 
attention in the context of network reliability. Various exact 
methods, as well as approximate methods, have been proposed. 
We refer interested readers to several surveys on the topic 
(Aggarwal et al., 1975; Hwang et aL, 1981). Theoretical results 
on the complexity of the problem reveal that it is #P complete 
(Brown and Colbourn, 1996; Husfeldt and Taslaman, 2010; 
Provan and Ball, 1983). The problem is significantly simphfied 
in the case of acychc graphs. This type of graphs can be repre- 
sented as Bayesian Networks, for which various inference 
algorithms exist. However, for this simple case sophisticated in- 
ference algorithms are unnecessary. In the context of biological 
networks, the problem for general graphs was first addressed by 
Ourfah et al. (2007). The goal of these authors was to infer the 
structure of the signaling network that best explains a set of gene 
knockout pairs, given a protein-protein interaction network. To 
achieve this goal, they developed a method to compute the reach- 
abihty probability for each knockout gene pair. Their method is 
an exact solution based on the inclusion-exclusion principle (van 
Lint and Wilson, 1992). However, due to its high time complex- 
ity, this method works accurately only for very small networks 
(i.e., those with a few tens of nodes). PReach (Gabr et al., 
2013) computes the exact reachability probabihty based on 



polynomial multiplication. It is significantly faster than the in- 
clusion-exclusion method of Ourfali et al. (2007) for networks 
where there are many paths. However, it does not scale to large 
networks. Thus, the existing solutions cannot be used to study 
entire TRNs, and there is a great need for accurate yet efficient 
methods. 

Contributions. Here, we develop a novel method that computes 
the probabihty that a signal originated at a given source gene can 
reach to a given target gene in a given probabihstic network. 
Unhke existing methods, our solution is both precise (i.e., it 
computes this probability without error) and it scales to large 
networks. Our method follows a divide-and-conquer approach. 
We partition the given probabihstic network into a sequence of 
loosely connected clusters of nodes. On the boundary between 
two such consecutive clusters lies a set of nodes called node sep- 
arators. Any signal which originates from the source node and 
arrives at any node in the latter cluster must visit the node sep- 
arators. Similar to PReach (Gabr et al, 2013), we model the 
given probabihstic network using polynomials. The form of the 
polynomials of our method however differs from that of PReach 
in a way that allows us to collapse the polynomial to very small 
size that is determined by the size (number of interactions) of the 
clusters and the number of nodes in a given boundary. Each term 
in our polynomial evaluates the existence probability of a collec- 
tion of subsets of interactions. In brief, instead of computing the 
reachability probabihty from the given source node to the target 
node, we incrementally compute the reachability probability 
from the source node to each node separator in sequential 
order. That allows us to avoid storing a massive fraction of 
terms of the polynomial (i.e., the terms corresponding to the 
nodes in earher clusters). Our experimental results on real and 
synthetic datasets demonstrate that our method scales to very 
large network sizes while the inclusion-exclusion method 
(Ourfah et al, 2007) and PReach (Gabr et al, 2013) fail. We 
also observe that the reachability profiles provide a valuable re- 
source for characterizing leukemias and differentiating the cen- 
trahty of the genes across different leukemias as well as healthy 
control groups. 

In summary, the key contributions of this work are: 

• We introduce a new quantity for evaluating the state of a 
biological network, the reachability profile. 

• We introduce a novel, fast and scalable method to compute 
the reachabihty profile of large networks, based on polyno- 
mials and polynomial collapsing operators. 

• We demonstrate the usefulness of reachabihty profiles in 

detailed analysis of different types of leukemias. 

The rest of the article is organized as follows. Section 2 de- 
scribes our method. Section 3 presents our experimental results. 
Section 4 concludes with a brief discussion. 



2 METHOD 

In this section we present our method in detail. We first define the essen- 
tial theoretical concepts in Section 2.1. We then present an overview of 
our method in Section 2.2. We discuss how to compute intermediate 
reachability probabilities in Section 2.3. We elaborate on how to partition 
the network in Section 2.4. 
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2.1 Preliminary definitions 

We start by formally defining the probabilistic network concept. 
We provide a list of notations used throughout the article in the 
Supplementary Material. 

Definition 1 (probabilistic network). A probabilistic network is a graph 
Q = {V, E, P), where V is the set of nodes, E is the set of edges and P : E 
(0, 1] is a function that associates to each edge a probability value. 

In our context, each node in V represents a gene, each edge in 
E represents an interaction between two genes and P associates to each 
edge the probability of the existence of the interaction it represents. For 
instance, in Figure 1 (top figure), V= {a,b,c,d} and E= {£'i,£'2,^3,^'4}- 
We assume that each edge exists independent of all other edges. 
This assumption is commonly used in the literature for similar problems 
(Ceol et al., 2010; Szklarczyk et al., 2011). We limit our description 
to directed networks, although undirected networks can be dealt 
with by replacing each undirected edge with two edges in opposite 
directions. 

Given a probabilistic network Q = (V, E, P), we call the deterministic 
network G = {V,E) the maximal deterministic network of Q. In other 
words, the maximal deterministic network is the deterministic network 
in which all possible interactions of Q are present. 

The computational problem we address in this article is: given a prob- 
abiUstic network, Q = (V, E, P), a source node s e V and a target node 
t e V, what is the probability that the t can be reached from si 

Next, we define a graph concept, node separator, which will help us in 
explaining our method. 

Definition 2 (Node Separator). Let G = (V,E) be a deterministic net- 
work and s, t e V be two of its nodes. An s-t node separator of G is a set of 
nodes K <^ V whose removal disconnects t from s in G. 

Figure 2 illustrates this concept. Here, the source node s and the target 
node / are labeled with 1 and 8, respectively. The set of nodes {4, 5} is an 
s-t node separator. We say that a node separator is minimal if none of its 
proper subsets is a node separator. 

A node separator partitions the nodes of that network into three dis- 
joint subsets: 

(1) The left nodes are the nodes that are reachable from the source, but 
the target cannot be reached from any of them without going 
through the node separator (e.g., nodes 1, 2 and 3 in Figure 2, 
for the node separator {4, 5}). 

(2) The node separator itself (e.g., nodes 4 and 5 in Fig. 2). 

(3) The right nodes are the remaining nodes (e.g., nodes 6, 7 and 8 in 
Fig. 2). Notice that these are the nodes from which the target can 
be reached, but they are not reachable from the source without 
passing through the node separator. 

A node separator K also partitions the edges of the given network into 
three subsets: 

(1) The left edges are the edges between the nodes in the union of 
left and separator nodes (e.g., edges ^1,^2,^3,^5 and e^ in Fig. 2). 
We denote the set of left edges with L(K). 

(2) The right edges are the edges between right nodes or from a sep- 
arator node to a right node (e.g., edges e7,£'9,eio,eii in Fig. 2). 

(3) The backward edges are the edges from right nodes to the sep- 
arator nodes or from right nodes to left nodes (edges ^4,^3 in 
Fig. 2). 

Theorem 1. Let G = (V,E) be a deterministic network. Given two nodes, 
s, t e V, let K be an s-t node separator. For any right node u of K, it is 
guaranteed that K is also an s-u node separator. 




Fig. 2. A network with an s-t node separator. The source is node 1 and 
the target is node 8. The dotted rectangle indicates an s-t node separator 



We prove Theorem 1 in the Supplementary Material. 

If a node separator does not yield backward edges, we call it a good 
node separator. We only use good node separators in the rest of the 
article. So, in what follows by node separator we refer to a good node 
separator, unless otherwise specified. Finally we define the concept of 
subset reachability in probabilistic networks. 

Definition 3 (Subset Reachability). Let Q = (V, E, P) be a probabilistic 
network. Let s and t (s, t e V) be source and target nodes in Q. Consider 
two s-t node separators Kj and Kj of Q such that for all nodes u e Kj\Ki, u is 
a right node of Kj. Let S and T be two subsets of Kj and Kj respectively. We 
say that Kj is T-reachable from S if all nodes in T are reachable from at 
least one node in S and none of the nodes in Kj\T is reachable from any 
node in S. We denote the probability that Kj is T-reachable from S by 
p(S,T,Kj). 



2.2 Overview of the method 

Our method works in two steps. 

Step 1. Given a probabilistic network Q^(V, E, P) and source and target 
nodes s and in the first step, we partition Q into a sequence of subnet- 
works that are connected to each other through node separators. In gen- 
eral terms, let us denote the sequence of node separators with Kq, 
Ki, Kc, Kc+i, where Kq = {s} and K^+i = {t}. We choose these 
node separators such that V/<7, for all nodes u e Kj\Ki, w is a right 
node of K^. Following from Definition 3, the problem we solve in this 
article is equivalent to computing p{K{), Kc+\, Kc+\)=p{{s}, [t], {/}). 

Step 2. At this step we compute the reachability probability from s to t. 
More specifically, using this notation above, for any i (0< /< c), we write 
the probability p{{s},T,Ki+i) as 

p{{s],T,K,^,)= J2 P({^},S,Kdp(S,T,K^^,) (1) 

The case z = 0 is a special one. Since Kq contains s, we have T = {s}. 
Thus the probability to reach the source node is 1. Following from 
Equation (1), our algorithm iteratively computes p{Kq,Kc+i,Kc+\) by 
moving from one node separator to the next, starting from Kq. 

The correctness of Equation (1) follows from the definition of node 
separator and Theorem 1. More specifically, in order to reach to any 
node in T c Ki+i, we have to visit at least one node in Kj. The product 
p({s],S,Ki)p(S,T,Ki+i) in Equation (1) is the probability that a signal 
reaches T by visiting all the nodes in S and no other node in Kj- S. 
The summation in this equation enumerates all possible subsets S ^ Kj. 
Thus, it is accumulates the probability of all possible alternative routes 
from s to T defined by all possible subsets S. 

Figure 3 illustrates our method. In this example, the set of edges in E is 
split into three non- overlapping sets using four node separators Kq, Ki, K2 
and K3 where ^0 = {s} and K3 = {t}. These sets are L(Ki), L(K2)\L(Ki), 
and L{Ki)\L{K2). Each of these sets define a subnetwork of Q. Once the 
network is partitioned this way, instead of computing the reachability 



i98 



Large scale analysis of signal reachability 



Ko El Ki E2 K2 E3 Ks 




Fig. 3. A hypothetical network with two disjoint s-t node separators, 
Ki = {2,3} and K2 = {4,5,6}. Source and target nodes are labeled with 
s = 1 and / = 7. For uniformity, we consider Kq ^ {1} and = {7} also 
to be node separators 



probability directly from s to t, we compute it incrementally by advancing 
from one node separator to the next. For the example in Figure 3, we first 
consider the separator Ki, then K2, finally K^. At each node separator, we 
only consider the subnetwork which contains the left edges of that 
separator. 

To understand Equation (1) better, consider the separators Ki and K2 
in Figure 3. There are three possible scenarios to reach to a subset T of 
K2, say T = {4} from s = I. Each of these scenarios corresponds to a 
nonempty subset of Ki = {2, 3}. 

(1) Visit S = {2} and do not visit ^1X5= {3}. This happens with prob- 
ability /7({ 1 } ,{2} ,^iM{2} , {4} ,^2) ■ 

(2) Visit S = {3} and do not visit Ki\S={2]. This happens with prob- 
abiHty/7({l},{3},^iM{3},{4},^2)- 

(3) Visit both nodes in 5* = {2, 3}. This happens with probability 
K{l},{2,3},^iM{2,3},{4},^2). 

The sum of the three probabilities above yields the probability 

pm,{4},K2). 

In Section 2.3, we explain how we compute Equation (1) efficiently for 
/>0 (Step 2). In Section 2.4, we explain how we choose the node separ- 
ators (Step 1). 

2.3 Computing the reachability probability 

In Equation (1), we presented an iterative formula to compute the reach- 
ability probability p({s} ,{t} ,{t}) by splitting the network using cuts 
Kq, Ki, . . . , Kc+i. Although this equation reduces the scale of the problem 
to the subnetworks between consecutive cuts, computing it efficiently still 
remains to be a challenge. Here, we describe how we compute this prob- 
abiUty efficiently yet provably correctly. More specifically, given two con- 
secutive node separators and Ki+i (0</<c) and given p({s},S,Ki) 
for all subsets S c Ki, we discuss how we compute p({s},T,Ki+ 1) for all 
subsets T ^ Ki+i. 

From the definition of left edges, we know that the probability 
p({s},S,Ki) depends only on the edges in L(Ki). This is because L(Ki) 
contains all the edges that can lie on a path from s to any node in Kj. 
Let us denote the set of edges in L{Kj)\L{Kj_\) with Ej for any 0<7 (i.e., 
left edges of Kj which are also right edges of Kj_ 1). Thus, the probability 
p{{s},T,Ki+ 1), depends only on the edges in + 1 when p{{s],S,K^ is given 
for all S. This implies that it is possible to compute the probability 
p{{s],T,Ki+i) by considering only the edges in Ei+i when p{{s},S,Ki) is 
known yS c Ki. Below, we compute this probability by transforming the 
probabilistic network into a collection of polynomials. 

Transformation into polynomial space. Assume that the given probabilistic 
network, Q^{V, E, P), contains n edges and m nodes, denoted with 
E = {€1,62, . . . e,J and V = {vi,V2, . . . v^}, respectively. As the first step 
of the transformation, we associate to each edge a polynomial called 
the edge polynomial. More precisely, for edge ei e E, let pi = P(ei) and 
qi = I- Pi denote the existence and absence probability of respectively. 



We define the edge polynomial of as the first degree polynomial of two 
variables, Xi and yt, Fi{xi,y^ = pjXi + qiyi. 

Consider a subset E' of the edges in E. We define the edge aggregation 
polynomial for E\ denoted with F{E'), as the product of all the edge 
polynomials associated with the edges in E'\ 

F{E')=WFi(x,,y,)=Y.WP'^'W Uyt- (2) 

eieE' £^E' e/eS ejeE'\£ 

Notice that each term in the summation above corresponds to one of 
the possible deterministic configurations for the network topology. The 
coefficient of the term YleieS ^iY[e eE\£ yj probability of obser- 

ving all the edges in S and not observing any edge in E\S. To understand 
this better, consider the network in Figure 1 (network on the top). In the 
edge aggregation polynomial of this network, the term corres- 
ponds to the deterministic instance where only edges ^3 and ^4 are present 
(i.e., bottom left network in Fig. 1). The coefficient of this term is q\q2P3P4 
which is the probability of observing that network instance. 

Reachability in polynomial space. As we explain in Equation (2), the terms 
of the edge aggregation polynomial represent different deterministic net- 
work configurations. Thus, the probability p{{s],T,Ki+i} is equal to the 
sum of the coefficients of a specific subset of the polynomial terms: The 
terms which yield a topology where Ki+i is T-reachable from {s}. 

At this point, the polynomial transformation seemingly makes the 
reachability problem as complicated as exhaustively enumerating all net- 
work topologies. This is because, (i) the edge aggregation polynomial has 
as many terms as the number of network topologies; and (ii) finding the 
subset of polynomial terms which yield the desired topologies will incur 
additional computational cost. Below, we build a novel algebra on the 
edge aggregation polynomial to compute this value by enumerating only 
a tiny fraction of the polynomial terms. 

Algorithm 1 presents a pseudocode that describes our algorithm for 
constructing the polynomial needed to compute p{{s},T,Ki+i). The algo- 
rithm takes the existing edge aggregation polynomial for the edges in 
L{K^ as input. At each iteration it grows that polynomial by aggregating 
it with the edge polynomial of a new edge in Ei+ 1 (Step 2). It then reduces 
the size of the resulting polynomial by collapsing it (Step 3). Briefly, the 
collapsation step merges all terms which correspond to configurations in 
which Kj+i is T-reachable from s, for each possible subset T Ki+i, 
into a single term by replacing the variables in these terms with a 
single variable denoted with zt- Thus, the coefficient of is the 
sum of the coefficients of the original terms that were collapsed. In the 
rest of this section, we elaborate on these steps, particularly the collapsa- 
tion step. 



Algorithm 1 Compute the edge aggregation polynomial for L(Ki+i) 

Require: Probabilistic graph Q = (V, E, P) 
Require: Node separators Ki and Ki+i 
Require: Edge aggregation polynomial F' = F(L(K^). 
1: for all ej e Ei+i do 

2: Aggregate edge polynomial of ej as F' = F'x Fj{xj,yj) 
3: Collapse F' 
4: end for 



We start by introducing some notation which will simplify our poly- 
nomial algebra below. For a subset of edges S <^ E,wq denote the set of 
indices of the edges in 8 by Ind{S). For instance, for S = {e2, e^, e^], we 
have//i4£') = {2,3, 8} 

Let us denote the subset of edges of Ei+ 1 which have been multiplied 
into the edge aggregation polynomial so far with S c Ei+\ and its set of 
indices with 0 = Ind(S). 

Following from Equation (2), since S and L(Ki) are disjoint, we can 
write the edge aggregation polynomial of the edge set 8 U L(Ki) as 



199 



A.Todor et al. 



F(S)F(L(Ki)). To simplify our notation of F(S), for all / c 0, we denote 
H/GZ-^i^; and Ylie@\iyi with xi and y^^j, respectively. We denote the coef- 
ficient of xiy^^i with a J. Thus, we can write the first polynomial as 

^(^) = E/c0«/xiy0\i. 

For each node separator Kj, we define a unique collapsing operator 
and denote it with p/(). This is a linear operator; it acts on the terms of the 
given edge aggregation polynomial for the edges in L(Ki) independently. 
Briefly, the collapsed polynomial contains a new variable, z^, for each 
subset S of Ki. The form of this polynomial is Pi(F(L(Ki))) = J2s<zKi ^s^s- 
In this representation, zs corresponds to the case where Ki is .S-reachable 
from ^0 (i-e., the original source node), and the coefficient is the 
probability of observing that case. In other words is equal to 
p({s},S,Ki) in Equation (1). We explain how this operator works and 
how we compute it in detail later in this section. For the moment, 
assume that we have already applied it for the edge set L(K^. Therefore 
we replace the polynomial, F{L(Kj)) in the product F{8)F(L(Kj)) with its 
collapsed version, denoted by piF{L{K^)). 

After multiplying the first polynomial and the collapsed version of the 
second polynomial, we get 

F{8)piF{L{Ki))) = J2 o^i^iJeM E ^s^s- (3) 

/C0 S^Ki 

Since this product includes edge polynomials from the edge set S, we 
further reduce its size by applying the collapsing operator on it 

and thus obtain Pi+^{F{S)Pi{F{L{Ki)))). 

Next, we explain how the collapsing operator works. Given two nodes 
u,v e V, let TT be a path from w to v in the maximal deterministic network 
G = (V,E) of the given probabilistic network. Here, by path we mean the 
set of edges traversed to reach from w to v. Let / be a subset of indices, 
I ^ {I, ... ,n}. We define two set indicator functions Xu,vO and cOu^^O for 
the node pair (u, v). The first one takes the value Xw,vW 1 if th^^ is a 
path TT from w to v such that Ind{7i:) c / and 0 otherwise. For instance, in 
Figure 2, xi,8({l?2,5,6,7,10,l 1}) = 1. This is because {e2,e5,e6?^7?^io} forms 
a path from 1 (source) to 8 (target) and its set of indices {2, 5, 6, 7, 10} is a 
subset of the input set {1, 2, 5, 6, 7, 10, 1 1}. Similarly, the second indicator 
function takes the value C0u,vi^ = 1 if there is a minimal u- v cut k such 
that Ind(K) c / and 0 otherwise. For example, a;i^8({2,3,4,5}) = 1, because 
{^3,^5} forms a minimal cut between nodes 1 and 8 and its set of indices 
{3, 5} is a subset of input set {2, 3, 4, 5}. 

Next, we extend the definitions of the set indicator functions x and co 
to multiple source nodes. The extended function Xs,v(^ evaluates to 1 if 
there is a path n from at least one node w in 5* to v such that Ind{n) c / 
and 0 otherwise. Similarly, cosA^ evaluates to 1 if for all nodes u e S 
there is at least a minimal u- v cut k such that Ind(K) c / and 0 otherwise. 
Formally, we compute these functions as 

Xs,uW = 1 - - and ajs,uW = U <^s,uW (4) 

seS seS 

Next, we formalize T-reachability of the node separator Ki+ 1. For this 
purpose, we define a new set indicator function Cs,t{) which evaluates to 
1 only if Kj+ 1 is T- reachable from S. Otherwise, it evaluates to 0. We 
compute this function as 

CsAr)-Y[^sAn n ^s,vmr). (5) 

ueT veKi+i\T 

We prove the correctness of Equations (4) and (5) in the 
Supplementary Materials. 

Now we are ready to put all the pieces together and compute the 
collapsing operator p,+ i. Recall that each term of the given edge aggre- 
gation polynomial indicates a deterministic subnetwork topology for the 
edges in S, combined with all deterministic topologies of the edges in 
L(Ki) in which Ki is 5*- reachable from Kq, for every S <^ Ki. If that com- 
bination ensures that Ki+i is T-reachable from Kq, then the collapsing 
operator p/+i replaces all the variables of that term with zt- More spe- 
cifically, consider a term in Equation (3) after the product has been 



expanded, in the form yis^iy@\i^s, where y^s^^il^s- We compute the 
collapsing operator p/+i() on this term as 

Pi+i(yi,s^iy@\izs) = yi,s J2 ^sM^t- 

+ ri.s{ n (1-C5,r«) x,ye\,zs 

The collapsing operator Pi+ 1() [see Equation (6)] transforms each term 
of the polynomial into a single term. The resulting term either contains 
the variable zt, where T c Ki^\, or remains unchanged. This is because 
Cs,T either takes the value 0 or 1. Thus, p/+ 1() leaves the term unchanged 
only if Csj = ^ for all T. When, Csj=\ for some T c the coeffi- 
cient of Zs becomes 0. It returns Yi,s^t in this case. Furthermore, from 
Equation (5), we know that if Cs,t= 1, then for all Ti^T (T c Ki+i), 
Cs,T' = 0. Thus, the function p,+ 1() returns no other term containing vari- 
able Zt- 

Now suppose that a term has collapsed to zt and a new edge ej is 
added in Step 2 of Algorithm 1 . From a polynomial point of view, the zt 
variable will be multiplied with xj and yj, respectively, resulting in two 
new terms. From the graph reachability point of view, we know that the 
edges added prior to ej already ensure T- reachability, so ej does not make 
any differece: both its presence and its absence lead to reachable graph 
configurations. In the polynomial, the coeficients of ztXj and zxyj have to 
be added together to obtain the reachability probability. To take advan- 
tage of this observation, we introduce a special multiplication rule for the 
Zt variables: both ztXj and zryj are replaced with zt, for all ej e Ei+ 1, so 
that their coefficients are added together. 

The collapsing operator is very powerful as it ensures that the size of 
the edge aggregation polynomial never exceeds 2'^''^'^'+^' in the worst 
case (i.e., when the indicator function Cs,tO always returns 0 until the last 
edge in L{Ki+ 1) is aggregated). More importantly, it guarantees to reduce 
the polynomial size down to 2'^'+^' once the edges in L(Ki+i) are all 
aggregated. This is a significant improvement as without the collapsation 
function, the size of the edge aggregation polynomial 2'^*^^'+^^' after con- 
sidering Ki+ 1 and it goes up to 2'^' after including all the edges. 

So what is the reachability probability? After all the edges in Ei+ 1 have 
been added, all the terms will collapse, and the polynomial will be 
Pi+i{F{L{Ki+i))) = J^T^Ki+i yr^T- When Kc+\ = {/} is reached, the poly- 
nomial will have only two terms: y{f^^Z{t} + y^zt^. The coefficient y^t} is 
equal to the probability that the target node is reachable from the 
source node. We prove the correctness of our method in the 
Supplementary Material. 

2.4 How to choose node separators 

Depending on the topology of the maximal deterministic network there 
can be many alternative sequences of node separators between the source 
and target nodes. Regardless of how we choose the node separators, our 
method guarantees to return the correct result. The node separator choice 
however can affect the size of the intermediate polynomials and thus the 
running time of our method in two ways, (i) Ideally, each node separator 
Ki should contain a small number of nodes as it will produce 2'^'' vari- 
ables of the form zs- (ii) Each consecutive node separators should contain 
a small number of edges between them (i.e., Ei should be small). This is 
because, in the worst case, they yield 2'^'' terms. Finding an optimal 
sequence of node separators that minimizes the overall computation 
time is in itself an intriguing area worth investigating. The right balance 
between the separator size, the size of the edge sets between the separators 
and the amount of computation we are willing to spend on finding the 
solution is hard to find. Here, we use a greedy approach to find good 
node separators. 

We consider the first node separator {Kq) to be the source node itself. 
We determine the next node separator from the current one by 
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considering all nodes that are one edge further from the current node 
separator. The set of nodes identified in this way is a minimal node sep- 
arator, but it is not necessarily good, because it may contain nodes with 
incident backward edges — see Section 2.1. To make it good, we first 
identify the nodes that have incident backward edges and replace each 
of them with all the nodes that are reachable from them in one hop. Thus 
we advance the node separator toward the target keeping it minimal, and 
stop as soon as we encounter a good minimal node separator. This way, 
we aim to keep the size of Ej small. We repeat this process to select more 
good node separators until we reach the target. 

3 RESULTS 

In this section we experimentally evaluate our method. 
Section 3.1 presents the datasets and the experimental setup. 
Section 3.2 examines the running time of our algorithm. 
Section 3.3 presents the reachabihty profiles obtained with our 
method. Section 3.4 evaluates gene centrality based on the reach- 
abihty profiles. Section 3.5 analyses the stabihty of the human 
TRN. 

3.1 Datasets and implementation details 

We evaluate our method using both synthetic and real biological 
networks. 

Synthetic dataset. We generated the synthetic network dataset 
using the Barabasi-Albert random network model (Barabasi and 
Albert, 1999). We chose this model because it is the de facto 
standard for the scale-free networks, which best describe most 
biological networks (Jeong et aL, 2000; Tod or et aL, 2012; Yook 
et aL, 2004). We created six sets of random networks. In each set, 
we created 10 networks with the same number of nodes: 50, 100, 
1 50, 200, 250 and 300, respectively. The number of edges is twice 
the number of nodes in each network. 

Real dataset. For experimentation on real biological networks 
we used the human regulatory network of Ger stein et al. (2012). 
From this network, we selected only the reliable interactions by 
taking the intersection with those present in the DIP database 
(Xenarios et al, 2002). The resulting network has 130 nodes and 
172 edges. To assess the interaction confidence for each edge in 
this intersection, we used the logistic regression method used by 
Sharan et al. (2002). This strategy is used often in the literature to 
compute interaction confidence (Bader et al, 2004; Ourfah et al, 
2007; Sharan et al, 2002; Shlomi et al, 2006). We obtained the 
gene expression data of 575 leukemia patients from Zhang et al. 
(2012). We obtained control gene expression data in early pro- 
genitor cells from Laurenti et al. (2013). Both control and leuke- 
mia expression datasets are normalized using quantile 
normalization (Amaratunga and Cabrera, 2001). Each leukemia 
sample in our dataset belongs to one of eight different subtypes 
of leukemia: hyperdiploid, TCF3-PBX1, ETV6-RUNX1, MLL, 
Ph, Hypo, T-ALL and Other, or to non-leukemia sample types 
CD10CD19 and CD34. We do not include samples from the last 
two categories in our experiments, since they contain only four 
samples each. We trained eight different logistic regression 
models, one for each leukemia subtype to compute interaction 
probabihties for each subtype separately. Also, we classified the 
early progenitor cell samples into three categories: primitive 
(hematopoietic stem cells), lymphoid (FTP, MLP, ProB and 
B_NKpre) and myeloid (the rest of the samples). We trained a 
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Fig. 4. Average running time of our method on Barabasi-Albert net- 
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different logistic regression model for each type. Thus, we ob- 
tained different probability values for the edges of the human 
regulatory network, depending on the cancer or control group 
subtype in which the gene expression levels were measured. This 
in turn results in different reachability probabihties. We identi- 
fied all the source and all the target genes in our network using 
the hierarchical decomposition obtained by HIDFN (Gulsoy 
et al.,l^\T). This resulted in 9 source genes and 88 target genes. 

We used C + + , Matlab and R for implementation. We ran 
our experiments on an AMD Opteron processor with 256 GB of 
memory and 1.9 GHz speed. 



3.2 Evaluation of the running time 

In order to evaluate the performance of our method systematic- 
ally, we ran it on the synthetic networks of different sizes. We 
measured the running time for each synthetic network and each 
source-target pair. We have taken each node, in turn, as a source 
an then as a target. Thus, computing the reachability profile for 
the largest network size requires 300 x 299 = 89700 reachabihty 
probability computations per network. In total, we computed 
the reachability profile for 10 x 6 = 60 networks, for a total of 
2264500 reachability probabilities. In Figure 4, we report the 
average running time to compute the reachability probability 
for one source-target pair for each set of networks. We report 
the average running time over all networks in the set and over all 
source-target pairs. 

The figure shows that the running time of our method in a 
scale- free network grows at most linearly in terms of number of 
nodes. Even for networks as large as 300 nodes and 600 edges, 
the average running time of our method per source-target pair 
remains in milhseconds. This small running time allows us to com- 
pute the entire reachability profiles in practical time for a large 
number of networks, which was not possible before. 

For comparison, the inclusion/exclusion method (Ourfah 
et al, 2007) and PReach (Gabr et al. 2013) fail to complete exe- 
cution on the same dataset because they exhausted the 256 GB of 
memory available in the system even for a single source-target 
node pair of the smallest network in our dataset. 

For the real dataset investigated in this article, we computed 
1 1 reachability profiles, one for each leukemia or control group 
subtype. For each subtype, we computed 9 x 88 reachability 
probabilities (for 9 sources and 88 target nodes), thus 8712 prob- 
abilities in total. Our method computed each of these 
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Fig. 5. Reachability profiles in the human regulatory network. Each row 
represents a cancer type or a control group. Each column represents a 
source-target pair. The intensity of each cell represents the reachability 
probability for that source-target pair — lighter color means higher 
probability 



probabilities in only 2.5 s on the average. Both PReach and the 
inclusion-exclusion method fail to scale to this network size. 

3.3 Reachability profiles in the human TRN 

For each leukemia or control network, we computed the reach- 
ability probability for each pair of source-target nodes. We call 
this the reachabihty profile of the network. In Figure 5 we show 
the reachability profiles for all leukemia subtypes and control 
groups in a heat map. Each row in the figure represents a leuke- 
mia subtype or a control group, and each column represents a 
source-target pair. The color intensity at a location represents 
the reachabihty probability for that pair. We applied hierarchical 
clustering on both dimensions based on the reachabihty profiles. 
Hierarchical clustering correctly clusters the control groups sub- 
types together, as well as all the leukemias. This shows that the 
reachability profile can distinguish between healthy and leukemia 
cases. 

Source and target gene groups that show a noticeable gap 
between their reachability probabilities in control versus leuke- 
mia cases include SPIl, POU2F2 as sources and TOPBPl, 
TFDPl, TFDP2, HDACl, CDK8, REL, RELA and NFKB2 
as targets. While these sources and targets have low a reachabil- 
ity probability for control groups, they exhibit a higher range in 
leukemia subtypes. 

Our findings resonate with earher observations. Our method 
clusters the hyperdiploid and the ETV6-RUNX1 subtypes to- 
gether, while in (Zhang et al., 2012), Supplementary Figure 
S22, a significant number of genes exhibit similar expression 
levels in these subtypes. They are frequently studied together, 
as they are both related to a favorable prognosis in children 
(Liang et al., 2010; Paulsson et al., 2010). On the contrary, the 
Hypo subtype, which is least similar to Hyperdiploid and ETV6- 
RUNXl in our results, is associated with poor outcome 
(Holmfeldt et al., 2013). 

To further appreciate the value added by the reachabihty pro- 
files to our results, we performed another experiment based 
solely on gene expression data, without taking the regulatory 
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network into account. In this experiment, we clustered the gene 
expression samples using /c-means clustering. We set = 11, as 
there are totally 11 subtypes in our dataset. Then, within each 
cluster, we examined the distribution of each leukemia type. The 
results are shown in Figure 6. Our results demonstrate that, with 
the exception of cluster 10, consisting primarily of T-ALL sam- 
ples, all the clusters are a heterogeneous mix and do not have a 
definitive dominant leukemia type. Although one cluster consists 
only of control samples, the control subtypes are mixed together. 
Furthermore, the myeloid subtype samples are spread out 
through the rest of the clusters. We conclude that clustering 
based on gene expression alone is insufficient for classifying leuke- 
mia types. 

In light of these experimental observations, reachability profiles 
prove to be a reliable and valuable tool for assessing the viability of 
TRNs working as a whole. 

3.4 Gene centrality using reachability profiles 

We further illustrate the usefulness of reachabihty profiles by 
analysing the centrality of genes based on their contribution to- 
ward the reachability profile (Gabr and Kahveci, 2013). For this 
experiment, we compare the reachability profiles for the original 
network with the reachability profiles obtained by eliminating 
one gene from the network. Thus, for each gene, we compute 
its centrality by comparing the reachability profile for the ori- 
ginal network with the reachability profile obtained when the 
gene is missing. For a given gene g, whose centrality is under 
consideration, and a given source-target pair, the difference in 
reachability probabihty can be seen as the probabihty that the 
source-target pair is indispensable for connecting the source to 
the target; in other words, {g] is a node separator. Then the sum 
of this value over all source-target pairs is the average number of 
source-target pairs for which g is indispensable. To formalize 
this description, let us denote the set of source and target genes 
with S and T, respectively. We also denote the probabihty that 
gene t is reachable from gene s in the original network with p{s,t) 
and the same probabihty for the network where gene g is 
removed with p-g{s, t). The centrahty of gene g is defined as 

Uses T.teTP{sJ)-p-g{s, t). 

Figure 7 plots the centrality values for each leukemia type and 
each gene. We excluded from the plot the genes having centrahty 
smaller than 1. As expected, only a few genes have a high cen- 
trality, which is a characteristic of scale-free networks. We also 
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Fig. 7. Centrality of genes in the human regulatory network for different 
leukemia subtypes. Light color denotes high centrality. We only show the 
genes with centrality value >1 for at least one network 



performed hierarchical clustering of the leukemia subtypes and 
of the genes based on their centrality. We observe that the most 
similar subtypes of leukemias are T-ALL and Ph. The Ph sub- 
type is a chromosomal abnormality resulting from the same 
translocation found in ALL (Talpaz et aL, 2006). The least simi- 
lar to the first two is Hypo, like in the reachabihty profiles ex- 
periment. TP53 and RBI are two of the most central genes 
identified by our method. They are both characterized by alter- 
ations in Hypodiploid ALL (Holmfeldt et aL, 2013). We see that 
the most central gene is E2F1, which a transcription factor 
known to have a crucial role in cell cycle and tumor suppression 
(Neuman et aL, 1996). Thus, malfunctioning of this gene severely 
affects many pathways in the regulatory network. Likewise, the 
following two reachable genes, MYC and TBP are known hubs 
regulating important functions. MYC is involved in cell prolif- 
eration and its persistent expression is common to many cancers 
(Nesbit et aL, 1999), while TBP is related to RNA polymerase II, 
an essential element of DNA transcription initiation (Kornberg, 
2007). Among the top genes we identified based on their central- 
ity is also EP300, a histone -modifying gene which was reported 
to inactivate lesions disrupting hematopoietic development in 
ETP ALL (Zhang et aL, 2012). 

3.5 Assessment of network stability 

Beside characterization of single genes using centrality, we also 
performed and experiment to characterize the entire human 
TRN. In this experiment, we assess the level of stabihty of 
each of the studied networks. We measure the stabihty of the 
network as the average change in reachability probabihty when 
edge probabilities are randomly perturbed. 

Consider the given probabihstic network G = (V, E, P) and the 
sets of source and target nodes S and T. Also consider a param- 
eter 8 that denotes the maximum change in edge probabihties. 
We defined a perturbed edge probabihty function [0,1] 
that, for each edge e e E, returns a value drawn uniformly at 
random from the range P(e) ± 5 Pi [0, 1]. We constructed a per- 
turbed network = (V, E, P^). For every pair of s e S and 
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t e T, WQ measured the reachabihty probabihty in G as p(s,t), 
as well as that in G^ as p^{s,t). We then computed the absolute 
difference \p^{s, t) — p{s, t)\. We repeated this experiment 
20 times. We computed the average of the resulting values over 
diW s e S and ^ e T, as well as over the 20 experiments. 

We plotted the results for different values of <5 for the leukemia 
networks as well as for the control networks. Figure 8 shows the 
results. The first observation we draw from the figure is that the 
change in reachability probability for all networks is linear. We 
also observe that even by perturbing the edge probabihties in the 
range of ±0.3, the change in reachability probability does not 
exceed 0. 1 . From these two observations we can judge transcrip- 
tion-factor regulation in homo sapiens as highly stable and in- 
sensitive to random perturbation. This conclusion holds for both 
healthy people and leukemia patients. However, we also observe 
that the gap between the networks is not constant; it shghtly 
increases with the increase of perturbation level. At the extreme 
case of ±0.3, the gap is maximum. There, T-ALL show the most 
sensitivity to this level of perturbation, while Hypo and MLL 
show the least sensitivity. 



4 CONCLUSION 

In this article we have characterized different types of leukemias 
based on the state of the regulatory networks in patients affected 
by this disease. The state is evaluated through reachability pro- 
files. The reachability profile describes the ability of regulator 
genes to affect the transcription factors. For this we developed 
a fast, exact method for computing the probabihty for a signal to 
reach from a source node to a destination node in a probabilistic 
network. The rigorous mathematical apparatus, which involves 
polynomials and polynomial collapsing operators, aUows fast 
execution time, demonstrated in the performance evaluation ex- 
periments. Valuable uses of the reachability profiles illustrated in 
this article include leukemia subtype classification, gene central- 
ity evaluation and regulatory network stabihty analysis. All these 
are valuable tools for evaluating the viability of the TRN under 
varying conditions as a whole, not just limited to individual gene 
expressions levels. An interesting parallel can be drawn between 
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our solution and Bayesian Network inference. However, as we 
mentioned in Section 1, this alternative is limited to acyclic net- 
works. We see a possible application of the Bayesian Network 
alternative in combination with the reduction of strongly con- 
nected components to single nodes, but this solution deserves a 
careful examination by itself. 
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